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Abstract 


Graphical  models,  such  as  Bayesian  networks  and  Markov  Ran¬ 
dom  Fields  represent  statistical  dependencies  of  variables  by  a 
graph.  Local  “belief  propagation”  rules  of  the  sort  proposed  by 
Pearl  [18]  are  guaranteed  to  converge  to  the  correct  posterior  prob¬ 
abilities  in  singly  connected  graphical  models.  Recently,  a  number 
of  researchers  have  empirically  demonstrated  good  performance  of 
“loopy  belief  propagation” -using  these  same  rules  on  graphs  with 
loops.  Perhaps  the  most  dramatic  instance  is  the  near  Shannon- 
limit  performance  of  “Turbo  codes” ,  whose  decoding  algorithm  is 
equivalent  to  loopy  belief  propagation. 

Except  for  the  case  of  graphs  with  a  single  loop,  there  has  been 
little  theoretical  understanding  of  the  performance  of  loopy  propa¬ 
gation.  Elere  we  analyze  belief  propagation  in  networks  with  arbi¬ 
trary  topologies  when  the  nodes  in  the  graph  describe  jointly  Gaus¬ 
sian  random  variables.  We  give  an  analytical  formula  relating  the 
true  posterior  probabilities  with  those  calculated  using  loopy  prop¬ 
agation.  We  give  sufhcient  conditions  for  convergence  and  show 
that  when  belief  propagation  converges  it  gives  the  correct  poste¬ 
rior  means  for  all  graph  topologies,  not  just  networks  with  a  single 
loop. 

The  related  “max-product”  belief  propagation  algorithm  hnds  the 
maximum  posterior  probability  estimate  for  singly  connected  net¬ 
works.  We  show  that,  even  for  non-Gaussian  probability  distribu¬ 
tions,  the  convergence  points  of  the  max-product  algorithm  in  loopy 
networks  are  at  least  local  maxima  of  the  posterior  probability. 

These  results  motivate  using  the  powerful  belief  propagation  algo¬ 
rithm  in  a  broader  class  of  networks,  and  help  clarify  the  empirical 
performance  results. 


YW  is  supported  by  MURI-ARO-DAAH04-96-1-0341 


Problems  involving  probabilistic  belief  propagation  arise  in  a  wide  variety  of  appli¬ 
cations,  including  error  correcting  codes,  speech  recognition  and  medical  diagnosis. 
Typically,  a  probability  distribution  is  assumed  over  a  set  of  variables  and  the  task 
is  to  infer  the  values  of  the  unobserved  variables  given  the  observed  ones.  The 
assumed  probability  distribution  is  described  using  a  graphical  model  [13]  —  the 
qualitative  aspects  of  the  distribution  are  specihed  by  a  graph  structure.  The  graph 
may  either  be  directed  as  in  a  Bayesian  network  [18,  11]  or  undirected  as  in  a  Markov 
Random  Field  [18,  10].  Different  communities  tend  to  prefer  different  graph  for¬ 
malisms  (see  [19]  for  a  recent  review)  —  directed  graphs  are  more  common  in  AI, 
medical  diagnosis  and  statistics  while  undirected  graphs  are  more  common  in  im¬ 
age  processing,  statistical  physics  and  error  correcting  codes.  In  this  paper  we  use 
the  undirected  graph  formulation  because  one  can  always  perform  inference  on  a 
directed  graph  by  converting  it  to  an  equivalent  undirected  graph. 

If  the  graph  is  singly  connected  (i.e.  there  is  only  one  path  between  any  two  given 
nodes)  then  there  exist  efficient  local  message-passing  schemes  to  calculate  the 
posterior  probability  of  an  unobserved  variable  given  the  observed  variables.  Pearl 
(1988)  derived  such  a  scheme  for  singly  connected  Bayesian  networks  and  showed 
that  this  “belief  propagation”  algorithm  is  guaranteed  to  converge  to  the  correct 
posterior  probabilities  (or  “beliefs”).  However,  as  Pearl  noted,  the  same  algorithm 
will  not  give  the  correct  beliefs  for  multiply  connected  networks: 

When  loops  are  present,  the  network  is  no  longer  singly  connected 
and  local  propagation  schemes  will  invariably  run  into  trouble  .  .  . 

If  we  ignore  the  existence  of  loops  and  permit  the  nodes  to  continue 
communicating  with  each  other  as  if  the  network  were  singly  con¬ 
nected,  messages  may  circulate  indehnitely  around  the  loops  and 
the  process  may  not  converge  to  a  stable  equilibrium  .  .  .  Such  os¬ 
cillations  do  not  normally  occur  in  probabilistic  networks  .  .  .  which 
tend  to  bring  all  messages  to  some  stable  equilibrium  as  time  goes 
on.  However,  this  asymptotic  equilibrium  is  not  coherent,  in  the 
sense  that  it  does  not  represent  the  posterior  probabilities  of  all 
nodes  of  the  network  (Pearl  1988,  p.  195) 

Despite  these  reservations.  Pearl  advocated  the  use  of  belief  propagation  in  loopy 
networks  as  an  approximation  scheme  (J.  Pearl,  personal  communication)  and  one 
of  the  exercises  in  [18]  investigates  the  quality  of  the  approximation  when  it  is 
applied  to  a  particular  loopy  belief  network. 

Several  groups  have  recently  reported  excellent  experimental  results  by  using  this 
approximation  scheme  —  by  running  algorithms  equivalent  to  Pearl’s  algorithm  on 
networks  with  loops  [8,  17,  7].  Perhaps  the  most  dramatic  instance  of  this  perfor¬ 
mance  is  in  an  error  correcting  code  scheme  known  as  “Turbo  codes”  [3].  These 
codes  have  been  described  as  “the  most  exciting  and  potentially  important  devel¬ 
opment  in  coding  theory  in  many  years”  [16]  and  have  recently  been  shown  [12,  15] 
to  utilize  an  algorithm  equivalent  to  belief  propagation  in  a  network  with  loops. 
Although  there  is  widespread  agreement  in  the  coding  community  that  these  codes 
“represent  a  genuine,  and  perhaps  historic,  breakthrough”  [16]  a  theoretical  under¬ 
standing  of  their  performance  has  yet  to  be  achieved. 

Progress  in  the  analysis  of  loopy  belief  propagation  has  been  made  for  the  case  of 


networks  with  a  single  loop  [22,  23,  6,  2].  For  these  networks,  it  can  be  shown  that: 


•  Unless  all  the  compatibilities  are  deterministic,  loopy  belief  propagation 
will  converge. 

•  An  analytic  expression  relates  the  correct  marginals  to  the  loopy  marginals. 
The  approximation  error  is  related  to  the  convergence  rate  of  the  messages 
—  the  faster  the  convergence  the  more  exact  the  approximation. 

•  If  the  hidden  nodes  are  binary,  then  the  loopy  beliefs  and  the  true  beliefs 
are  both  maximized  by  the  same  assignments,  although  the  conhdence  in 
that  assignment  is  wrong  for  the  loopy  beliefs. 

In  this  paper  we  analyze  belief  propagation  in  graphs  of  arbitrary  topology  but 
focus  primarily  on  nodes  that  describe  jointly  Gaussian  random  variables.  We  give 
an  exact  formula  that  relates  the  correct  marginal  posterior  probabilities  with  the 
ones  calculated  using  loopy  belief  propagation.  We  show  that  if  belief  propagation 
converges  then  it  will  give  the  correct  posterior  means  for  all  graph  topologies,  not 
just  networks  with  a  single  loop.  The  covariance  estimates  will  generally  be  incorrect 
but  we  present  a  relationship  between  the  error  in  the  covariance  estimates  and  the 
convergence  speed.  For  Gaussian  or  non-Gaussian  variables,  we  show  that  the 
“max-product”  algorithm,  which  calculates  the  MAP  estimate  in  singly  connected 
networks,  only  converges  to  points  that  are  at  least  local  maxima  of  the  posterior 
probability  of  loopy  networks.  This  motivates  using  this  powerful  algorithm  in  a 
broader  class  of  networks. 

1  Belief  propagation  in  undirected  graphical  models 

An  undirected  graphical  model  (or  a  Markov  Random  Field)  is  a  graph  in  which 
the  nodes  represent  variables  and  arcs  represents  compatability  constraints  between 
them.  Assuming  all  probabilities  are  nonzero,  the  Hammersley-Glifford  theorem 
(e.g.  [18])  guarantees  that  the  probability  distribution  will  factorize  into  a  product 
of  functions  of  the  maximal  cliques  of  the  graph. 

Denoting  by  x  the  values  of  all  variables  in  the  graph,  the  factorization  has  the 
form: 

P{x)  =  l[^ix,)  (1) 

C 

where  Xc  is  a  subset  of  x  that  form  a  clique  in  the  graph. 

We  will  assume,  without  loss  of  generality,  that  each  Xi  node  has  a  corresponding 
yi  node  that  is  connected  only  to  Xi. 

Thus: 

c  i 


The  restriction  that  all  the  yi  variables  are  observed  and  none  of  the  Xi  variables 
are  is  just  to  make  the  notation  simple  —  t/,)  may  be  independent  of  yi 

(equivalent  to  yi  being  unobserved)  or  "^{xi,  yi)  may  be  S{xi  —  Xg)  (equivalent  to  Xi 
being  observed,  with  value  Xg). 


Figure  1:  Any  Bayesian  network  can  be  converted  into  an  undirected  graph  with 
pairwise  cliques  by  adding  cluster  nodes  for  all  parents  that  share  a  common  child, 
a.  A  Bayesian  network,  b.  The  corresponding  undirected  graph  with  pairwise 
cliques.  A  cluster  node  for  {B,C)  has  been  added.  The  potentials  can  be  set 
so  that  the  joint  probability  in  the  undirected  network  is  identical  to  that  in  the 
Bayesian  network.  In  this  case  the  update  rules  presented  in  this  paper  reduce  to 
Pearl’s  propagation  rules  in  the  original  Bayesian  network  [23]. 


In  describing  and  analyzing  belief  propagation  we  assume  the  graphical  model  has 
been  preprocessed  so  that  all  the  maximal  cliques  consist  of  pairs  of  units.  Any 
graphical  model  can  be  converted  into  this  form  before  doing  inference  through  a 
suitable  clustering  of  nodes  into  large  nodes  [23].  Figure  1  shows  an  example  of 
such  a  conversion. 

Equation  2  becomes 


c  i 


(3) 


Here  each  clique  c  corresponds  to  an  edge  in  the  graph  and  ,  Xc^  refer  to  the  two 
nodes  connected  by  the  edges. 

The  advantage  of  preprocessing  the  graph  into  one  with  pairwise  cliques  is  that  the 
description  and  the  analysis  of  belief  propagation  becomes  simpler.  For  complete¬ 
ness,  we  review  the  belief  propagation  scheme  used  in  [23]. 

At  every  iteration,  each  node  sends  a  (different)  message  to  each  of  its  neighbors 
and  receives  a  message  from  each  neighbor.  Let  V  and  W  be  two  neighboring  nodes 
in  the  graph.  We  denote  by  mvw{w)  the  message  that  node  V  sends  to  node  W. 
w  is  the  vector-valued  random  variable  at  node  W.  We  denote  by  hv{v)  the  belief 
at  node  V. 


The  belief  update  (or  “sum-product”  update)  rules  are: 


rrivwi'w)  a  'I')!/  =  v,W  =  w) 

nizviv) 

(4) 

J  V 

zeN{v)\w 

hv{v)  t-  a  mzv{v) 

(5) 

zeN{v) 

where  a  denotes  a  normalization  constant  and  At(y)\fT  means  all  nodes  neighbor¬ 
ing  V,  except  W. 

The  procedure  is  initialized  with  all  message  vectors  set  to  constant  functions. 
Observed  nodes  do  not  receive  messages  and  they  always  transmit  the  same  vector-if 
Y  is  observed  to  be  in  state  y  then  niYxix)  =  '?(!”  =  y^X  =  x).  The  normalization 
of  mvw  in  equation  4  is  not  necessary-whether  or  not  the  message  are  normalized, 
the  belief  hy  will  be  identical.  However,  normalizing  the  messages  avoids  numerical 
underflow  and  adds  to  the  stability  of  the  algorithm.  We  assume  throughout  this 
paper  that  all  nodes  simultaneously  update  their  messages  in  parallel. 

It  is  easy  to  show  that  for  singly  connected  graphs  these  updates  will  converge 
in  a  number  of  iterations  equal  to  the  diameter  of  the  graph  and  the  beliefs  are 
guaranteed  to  give  the  correct  posterior  marginals:  hv{v)  =  P{V  =  v\0)  where  O 
denotes  the  set  of  observed  variables. 

This  message  passing  scheme  is  equivalent  to  Pearl’s  belief  propagation  in  directed 
graphs  of  arbitrary  clique  size  —  for  every  message  passed  in  this  scheme  there 
exists  a  corresponding  message  in  Pearl’s  algorithm  when  the  directed  graph  is 
converted  to  an  undirected  graph  with  pairwise  cliques  [23].  For  particular  graphs 
with  particular  settings  of  the  potentials,  Equs.  4-5  yield  other  well-known  Bayesian 
inference  algorithms,  such  as  the  forward-backward  algorithm  in  Hidden  Markov 
Models,  the  Kalman  Filter  and  even  the  Fast  Fourier  Transform  [1,  12]. 

A  related  algorithm,  “max-product”,  changes  the  integration  in  equation  4  to  a 
maximization.  This  message-passing  is  equivalent  to  Pearl’s  “belief  revision”  al¬ 
gorithm  in  directed  graphs.  For  particular  graphs  with  particular  settings  of  the 
potentials,  the  max-product  algorithm  is  equivalent  to  the  Viterbi  algorithm  for 
Hidden  Markov  Models,  and  concurrent  dynamic  programming.  We  dehne  the 
max-product  assignment  at  each  node  to  be  the  value  that  maximizes  its  belief 
(assuming  a  unique  maximizing  value  exists).  For  singly  connected  graphs,  the 
max-product  assignment  is  guaranteed  to  give  the  MAP  assignment. 

1.1  Gaussian  Markov  Random  Fields 

A  Gaussian  MRF  (GMRF)  is  an  MRF  in  which  the  joint  distribution  is  Gaussian. 
We  assume,  without  loss  of  generality,  that  the  joint  mean  is  zero  (the  means  can 
be  added-in  later),  so  the  joint  probability,  P{x),  is 

P{x)  =  (6) 

where  V  is  the  inverse  covariance  matrix  and  a  denotes  a  normalization  constant. 
The  MRF  properties  guarantee  that  Vxxihj)  =  0  if  Xi  is  not  a  neighbor  of  Xj.  R  is 
straightforward  to  write  the  inverse  covariance  matrix  describing  the  GMRF  which 
respects  the  statistical  dependencies  within  the  graphical  model  [4]. 


Note  that  when  we  expand  the  term  in  the  exponent  we  will  only  get  terms  of  the 
form  Vxxih  j)x{i)x{j).  Thus  there  exist  a  set  of  matrices  14,  one  corresponding  to 
each  pairwise  clique,  such  that: 

P{x)  =  (7) 

C 

So  that  for  any  pair  of  nodes  . 


1.2  Inference  in  Ganssian  MRFs 


The  joint  distribution  of  z  =  (  ^  )  is  given  by: 


P{z)  =  ae  2 


where  V  has  the  following  structure 


Tr  _  /  b4y 

^  ~  1/  V 

^yx  ^ yy 


(8) 

(9) 


Note  that  because  Xi  is  only  connected  to  j/p  Vxy  is  zero  everywhere  except  the 
diagonals. 

The  marginalization  formulas  for  Gaussians  allow  us  to  compute  the  conditional 
mean  of  x  given  the  observations  y.  Writing  out  the  exponent  of  Eq.  8  and  com¬ 
pleting  the  square  shows  that  the  mean  y  of  *,  given  y,  is  a  solution  to: 


^xxl^  —  ^xyV 

and  the  covariance  matrix  Cx\y  of  x  given  y  is: 

Cx\y  =  Vx-^  (11) 

We  will  denote  by  Cx,\y  the  *th  row  of  Cx\y,  so  the  marginal  posterior  variance  of 
Xi,  given  the  data,  is  Cx,\y{i}- 

Belief  propagation  in  Gaussian  MRFs  gives  simpler  update  formulas  than  the  gen¬ 
eral  purpose  case  (Eqs.  4  and  5).  The  messages  and  the  beliefs  are  all  Gaussians  and 
the  updates  can  be  written  directly  in  terms  of  the  means  and  covariance  matrices. 
Each  node  sends  and  receives  a  mean  vector  and  covariance  matrix  to  and  from  each 
neighbor,  in  general,  each  different.  The  beliefs  at  a  node  are  calculated  by  com¬ 
bining  the  means  and  covariances  of  all  the  incoming  messages.  For  scalar  nodes, 
the  beliefs  are  a  weighted  average  of  the  incoming  messages,  inversely  weighted  by 
their  variance. 

We  can  now  state  the  main  question  of  this  paper.  What  is  the  relationship  between 
the  true  posterior  means  and  covariances  (calculated  using  Eq.  10)  and  the  belief 
propagation  means  and  covariances  (calculated  using  the  belief  propagation  rules 
Eqs.  4-5)  ? 


Figure  2:  Left:  A  Markov  network  with  multiple  loops.  Right:  The  unwrapped 
network  corresponding  to  this  structure.  The  unwrapped  networks  are  constructed 
by  replicating  the  potentials  "^(xijXj)  and  observations  yi  while  preserving  the 
local  connectivity  of  the  loopy  network.  They  are  constructed  so  that  the  messages 
received  by  node  A  after  t  iterations  in  the  loopy  network  are  equivalent  to  those 
that  would  be  received  by  A  in  the  unwrapped  network.  An  observed  node,  j/p  not 
shown,  is  connected  to  each  depicted  node. 

2  Analysis 

To  compare  the  correct  posteriors  and  the  loopy  beliefs,  we  construct  an  unwrapped 
tree.  The  unwrapped  tree  is  the  graphical  model  that  the  loopy  belief  propagation 
is  solving  exactly  when  applying  the  belief  propagation  rules  in  a  loopy  network 
[9,  24,  23].  In  error-correcting  codes,  the  unwrapped  tree  is  referred  to  as  the 
“computation  tree”  —  it  is  based  on  the  idea  that  the  computation  of  a  message 
sent  by  a  node  at  time  t  depends  on  messages  it  received  from  its  neighbors  at  time 
t  —  I  and  those  messages  depend  on  the  messages  the  neighbors  received  at  time 
t  —  2  etc. 

To  construct  an  unwrapped  tree,  set  an  arbitrary  node,  say  xi,  to  be  the  root  node 
and  then  iterate  the  following  procedure  t  times: 

•  Find  all  leaves  of  the  tree  (start  with  the  root). 

•  For  each  leaf,  hnd  all  k  nodes  in  the  loopy  graph  that  neighbor  the  node 
corresponding  to  this  leaf. 

•  Add  k  —  I  nodes  as  children  to  each  leaf,  corresponding  to  all  neighbors 
except  the  parent  node. 

Each  node  in  the  loopy  graph  will  have  a  different  unwrapped  tree  with  that  node 
at  the  root. 

Figure  2  shows  an  unwrapped  tree  around  node  A  for  the  diamond  shaped  graph 
on  the  left.  Each  node  has  a  shaded  observed  node  attached  to  it  that  is  not 
shown  for  clarity.  Since  belief  propagation  is  exact  for  the  unwrapped  tree,  we  can 
calculate  the  beliefs  in  the  unwrapped  tree  by  using  the  marginalization  formulae 
for  Gaussians. 

We  use  '  for  unwrapped  quantities.  We  scan  the  tree  in  breadth  first  order  and 
denote  by  x  the  vector  of  values  in  the  hidden  nodes  of  the  tree  when  scanned  in 


this  fashion.  Simlarly,  we  denote  by  y  the  observed  nodes  scanned  in  the  same 

order.  As  before,  z  =  ^  ^  ^  To  simplify  the  notation,  we  assume  throughout  this 

section  that  all  nodes  are  scalar  valued.  In  section  4  we  generalize  the  analysis  to 
vector  valued  nodes. 

The  basic  idea  behind  our  analysis  is  to  relate  the  wrapped  and  unwrapped  inverse 
covariance  matrices.  By  the  nature  of  unwrapping,  all  elements  I4i/(b  i)  and  y(i)  are 
copies  of  the  corresponding  elements  Vxy{i',j')  and  y{i')  (where  Xi,Xj  are  replicas 
of  XitjXjt).  Also,  all  elements  Vxx{i,i)  where  i  and  j  are  non-leaf  nodes  are  copies 
of  Vxxii' ,  j')-  However,  the  elements  Vxxifj)  for  the  leaf  nodes  are  not  copies  of 
Vxxii' ,j')  because  leaf  nodes  are  missing  some  neighbors. 

Intuitively,  we  might  expect  that  if  all  the  equations  that  jl  satisihes  are  copies  of 
the  equations  that  y  satisihes,  then  simply  creating  fi  by  many  copies  of  y  would 
give  a  valid  solution  in  the  unwrapped  network.  However,  because  some  of  the 
equations  are  not  copies,  this  intuition  does  not  explain  why  the  means  are  exact 
in  Gaussian  networks. 

An  additional  intuition,  that  we  formalize  below,  is  that  the  inhuence  of  the  non- 
copied  equations  (those  at  the  leaf  nodes)  decreases  with  additional  iterations.  As 
the  number  of  iterations  is  increased,  the  distance  between  the  leaf  nodes  and  the 
root  node  increases  and  their  inhuence  on  the  root  node  decreases.  When  their 
inhuence  goes  to  zero,  the  mean  at  the  root  node  is  exact. 

Although  the  elements  Vxx{i,j)  ^re  copies  of  Vxx{i' ,]')  for  the  non-leaf  nodes,  the 
matrix  Vxx  is  not  simply  a  block  replication  of  Vxx  •  The  system  of  equations  that 
dehnes  y  is  a  coupled  system  of  equations.  Hence  the  variance  at  the  root  node 
14,“^  (1,  1)  differs  from  the  correct  variance  14“^  (1,  1). 

In  the  following  section  we  prove  the  following  three  claims. 

Assume,  without  loss  of  generality,  that  the  root  node  is  xi.  Let  y{l)  and  ci'^(l)  be 
the  conditional  mean  and  variance  at  node  1  after  t  iterations  of  loopy  propagation. 
Let  y[l)  and  (T^(1)  be  the  correct  conditional  mean  and  variance  of  node  1.  Let 
Cxi\y  be  the  conditional  correlation  of  the  root  node  with  all  other  nodes  in  the 
unwrapped  tree  then: 

Claim  1: 

y(l)  =  l^{l)  +  Cxyyr  (12) 

where  r  is  a  vector  that  is  zero  everywhere  but  the  last  L  components  (corresponding 
to  the  leaf  nodes). 

Claim  2: 

+  Cx^yn  -  Cxyyr2  (13) 

where  ri  is  a  vector  that  is  zero  everywhere  but  the  last  L  components  and  r2  is 
equal  to  1  for  all  components  corresponding  to  non-root  nodes  in  the  unwrapped 
tree  that  reference  xi.  All  other  components  of  r2  are  zero. 

Claim  3:  If  the  conditional  correlation  between  the  root  node  and  the  leaf  nodes 
decreases  rapidly  enough  then  (1)  belief  propagation  converges  (2)  the  belief  prop¬ 
agation  means  are  exact  and  (3)  the  belief  propagation  variances  are  equal  to  the 
correct  variances  minus  the  summed  conditional  correlations  between  xi  and  all  Xj 
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Figure  3:  The  conditional  correlation  between  the  root  node  and  all  other  nodes 
in  the  unwrapped  tree  for  the  diamond  hgure  after  15  iterations.  Potentials  were 
chosen  randomly.  Nodes  are  presented  in  breadth  hrst  order  so  the  last  elements 
are  the  correlations  between  the  root  node  and  the  leaf  nodes.  It  can  be  proven 
that  if  this  correlation  goes  to  zero  then  (1)  belief  propagation  converges  (2)  the 
loopy  means  are  exact  and  (3)  the  loopy  variances  equal  the  correct  variances  minus 
the  summed  conditional  correlation  of  the  root  node  and  all  other  nodes  that  are 
replicas  of  the  same  loopy  node.  Symbols  plotted  with  a  star  denote  correlations 
with  nodes  that  correspond  to  the  node  A  in  the  loopy  graph.  It  can  be  proven 
that  the  sum  of  these  correlations  gives  the  correct  variance  of  node  A  while  loopy 
propagation  uses  only  the  hrst  correlation. 


that  are  replicas  of  *1. 

To  obtain  intuition,  Fig.  3  shows  Cx^iy  for  the  diamond  hgure  in  Fig.  2.  We  gen¬ 
erated  random  potential  functions  and  observations  for  the  loopy  diamond  hgure 
and  calculated  the  conditional  correlations  in  the  unwrapped  network.  Note  that 
the  conditional  correlation  decreases  with  distance  in  the  tree  —  we  are  scanning  in 
breadth  hrst  order  so  the  last  L  components  correspond  to  the  leaf  nodes.  As  the 
number  of  iterations  of  loopy  propagation  is  increased  the  size  of  the  unwrapped 
tree  increases  and  the  conditional  correlation  between  the  leaf  nodes  and  the  root 
node  decreases. 

From  equations  12-13  it  is  clear  that  if  the  conditional  correlation  between  the  leaf 
nodes  and  the  root  nodes  are  zero  for  all  sufficiently  large  unwrappings  then  (1) 
belief  propagation  converges  (2)  the  means  are  exact  and  (3)  the  belief  propagation 
variances  are  equal  to  the  correct  variances  minus  the  summed  conditional  corre¬ 
lations  between  £i  and  all  xj  that  are  replicas  of  *1.  In  practice  the  conditional 
correlations  will  not  actually  be  equal  to  zero  for  any  hnite  unwrapping  so  claim  3 
states  this  more  precisely. 

2.1  Relation  of  loopy  and  nnwrapped  qnantities 

The  proof  of  all  three  claims  relies  on  the  relationship  between  the  elements  of  y,  Vxy 
and  Vxx  with  their  unwrapped  quantities,  described  below. 

Each  node  in  x  corresponds  to  a  node  in  the  original  loopy  network.  Let  O  be  a 


matrix  that  defines  this  correspondence.  0{i,j)  =  1  if  Xi  corresponds  to  Xj  and 
zero  otherwise.  Thus,  in  hgure  2,  ordering  the  nodes  alphabetically,  the  hrst  rows 
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Using  O  we  can  formalize  the  relationship  between  the  unwrapped  quantities  and 
the  original  ones.  The  simplest  one  is  y,  that  only  contains  replicas  of  the  original 

y- 

y=Oy  (15) 

Since  every  Xi  is  connected  to  a  yi ,  V^y  and  V^y  are  zero  everywhere  but  along  their 
diagonals  (the  block  diagonals,  for  vector  valued  variables).  The  diagonal  elements 
of  Vxy  are  simply  replications  of  V^y  hence: 


V,yO  =  OV,y  (16) 

Vxx  also  contains  the  elements  of  the  original  Vxx  but  here  special  care  needs  to 
be  taken.  Note  that  by  construction,  every  node  in  the  interior  of  the  unwrapped 
tree  has  exactly  the  same  statistical  relationship  with  its  neighbors  as  with  the 
corresponding  node  in  the  loopy  graph.  If  a  node  in  the  loopy  graph  has  k  neighbors 
then  a  node  in  the  unwrapped  tree  will  have,  by  construction,  one  parent  and  k  —  1 
children.  The  leaf  nodes  in  the  unwrapped  tree,  however,  will  be  missing  the  k  —  1 
children  and  hence  will  not  have  the  same  number  of  neighbors.  Thus,  for  all  nodes 
Xi,Xj  that  are  not  leaf  nodes,  Vxx{i,i)  is  a  copy  of  the  corresponding  Vxx{k,l), 
where  unwrapped  nodes  i  and  j  refer  to  loopy  nodes  k  and  I,  respectively. 

Therefore: 

VxxO  E  =  OVxx 

where  E  is  an  error  matrix.  E  is  zero  for  all  non-leaf  nodes  so  the  hrst  N 
of  E  are  zero. 

2.2  Proof  of  claim  1 

The  marginalization  equation  for  the  unwrapped  problem  gives: 

Exxy  —  Exyy  (i-^) 

Substituting  Eqs.  15  and  16,  relating  loopy  and  unwrapped  network  quantities,  into 
Eq.  18,  for  the  unwrapped  posterior  mean,  gives: 

Exxy  — 


(17) 
—  L  rows 


OX^xyV 


(19) 


For  the  true  means,  ji,  of  the  loopy  network,  we  have 


^xxl^  —  ^xyV 

To  relate  that  to  the  means  of  the  unwrapped  network,  we  left-multiply  by  O: 

=  -OVccyV-  (21) 

Using  Eq.  17,  relating  Vxx  to  Vxx,  we  have 

%xOii  +  En  =  -OVxyV  (22) 

Comparing  Eqs.  22  and  19  gives 

ExxOfi  +  E^i  =  Vxxfi  (23) 

or: 

fi  =  0^+V-,^E^.  (24) 

Using  Eq.  11 

jl  =  OnE  Cx\yEn.  (25) 

The  left  and  right  hand  sides  of  equation  25  are  column  vectors.  We  take  the  hrst 
component  of  both  sides  and  get: 

Hl)=  l^{l)  +  Cx,\yEii  (26) 

Since  E  is  zero  in  the  hrst  N  —  L  rows,  Eii  is  zero  in  the  hrst  N  —  L  components. 
□ 

2.3  Proof  of  claim  2 

From  Eq.  11, 

VxxCx\y  =  I.  (27) 

Taking  the  hrst  column  of  this  equation  gives: 

=  ei  (28) 

where  ei  (1)  =  1,  ei  (j  >  1)  =  0. 

Using  the  same  strategy  as  in  the  previous  proof,  we  left  multiply  by  O: 

OVxxCl^\y  =  Oei  (29) 

and  similarly  we  substitute  equation  17: 

y^^OCl^y  +  ECl^y  =  Oe,  (30) 

The  analog  of  equation  28  in  the  unwrapped  problem  is: 

y^^yxi\y  — 


(31) 


where  ei(l)  =  1,  ei(j  >  1)  =  0. 

Subtracting  Eqs.  30  and  31  and  rearranging  terms  gives: 

Cx^\y  =  +  V^J"{ei  -  Oei)  (32) 

Again,  we  take  the  hrst  row  of  both  sides  of  equation  32  and  use  the  fact  that  the 
hrst  row  of  V~^  is  Cxj^\y  to  obtain: 

(T^(l)  =  (T^(l)  +  Cx^\yECj^\y  +  Cxi\y(ei  -  Oei)  (33) 

Again,  since  E  is  zero  in  the  hrst  rows,  ECx-i^\y  i®  N  —  L 

components.  □ 

2.4  Proof  of  claim  3 

Here  we  need  to  dehne  what  we  mean  by  “rapidly  enough” .  We  restate  the  claim 
precisely. 

Suppose  for  every  e  there  exists  a  tf  such  that  for  all  t  >  tf  |Cj;qj,r|  <  emax,-  |r(i)| 
for  any  vector  r  that  is  nonzero  only  in  the  last  L  components  (those  corresponding 
to  the  leaf  nodes).  In  this  case,  (1)  belief  propagation  converges  (2)  the  means  are 
exact  and  (3)  the  variances  are  egual  to  the  correct  variances  minus  the  summed 
conditional  correlations  between  xi  and  all  non-root  Xj  that  are  replicas  of  x\ 

This  claim  follows  from  the  hrst  two  claims.  The  only  thing  to  show  is  that  Eg 
and  are  bounded  for  all  iterations.  This  is  true  because  the  rows  of  E  are 

bounded  and  g,  Cx.,^\y  do  not  depend  on  the  iteration.  □ 

How  wrong  will  the  variances  be?  The  term  Cxgyr2  in  Eq.  13  is  simply  the  sum  of 
many  components  oiC^py  Eigure  3  shows  these  components.  The  correct  variance 
is  the  sum  of  all  the  components  while  the  loopy  variance  approximates  this  sum 
with  the  hrst  (and  dominant)  term. 

Note  that  when  the  conditional  correlation  decreases  rapidly  to  zero  two  things 
happen.  Eirst,  the  convergence  is  faster  (because  C^pyri  approaches  zero  faster). 
Second,  the  approximation  error  of  the  variances  is  smaller  (because  Cxpyr2  is 
smaller).  Thus,  as  in  the  single  loop  case,  we  hnd  that  quick  convergence  is  corre¬ 
lated  with  good  approximation. 

In  practice,  it  may  be  difficult  to  check  whether  the  conditional  correlations  decrease 
rapidly  enough.  In  the  next  section  we  show  that  if  loopy  propagation  converges 
then  the  loopy  means  are  exact. 

3  Fixed  points  of  loopy  propagation 

Each  iteration  of  belief  propagation  can  be  thought  of  as  an  operator  E  that  inputs 
a  list  of  messages  and  outputs  a  list  of  messages  =  Em^*K  Thus  belief 

propagation  can  be  thought  of  as  an  iterative  way  of  finding  a  solution  to  the  fixed 
point  equations  Em  =  m  with  an  initial  guess  mo  in  which  all  messages  are  constant 
functions. 


Note  that  this  is  not  the  only  way  of  hnding  hxed-points.  McEliece  et  al.  [16] 
have  shown  a  simple  example  for  which  F  contains  multiple  hxed  points  and  belief 
propagation  hnds  only  one.  They  also  showed  a  simple  example  where  a  hxed-point 
exists  but  the  iterations  m  =  Fm  do  not  converge.  An  alternative  way  of  hnding 
hxed-points  of  F  is  described  in  [17]. 

In  this  section  we  ask,  suppose  a  hxed-point  m*  =  Fm*  has  been  found  by  some 
method,  how  are  the  beliefs  calculated  based  on  these  messages  related  to  the  correct 
beliefs? 

Claim  4:  For  a  Gaussian  graphical  model  of  arbitrary  topology,  if  m*  is  a  hxed- 
point  of  the  message-passing  dynamics  then  the  means  based  on  that  hxed-point 
are  exact. 

The  proof  is  based  on  the  following  lemma: 

Periodic  beliefs  lemma:  If  m*  is  a  hxed-point  of  the  message-passing  dynamics 
in  a  graphical  model  G  then  one  can  construct  a  modihed  unwrapped  tree  T  of 
arbitrarily  large  depth  such  that:  (1)  all  non-leaf  nodes  in  T  have  the  same  statistical 
relationship  with  their  neighbors  as  the  corresponding  nodes  in  G  and  (2)  all  nodes 
in  T  will  have  the  same  belief  as  the  beliefs  in  G  derived  from  m* . 

Proof:  The  proof  is  by  construction.  We  hrst  construct  an  unwrapped  tree  T  of 
the  desired  depth.  We  then  modify  the  potentials  and  the  observations  in  the  leaf 
nodes  in  the  following  manner.  For  each  leaf  node  Xi,  hnd  the  k  —  I  nodes  in  G  that 
neighbor  (where  Xi  is  a  replica  of  *,■/)  excluding  the  parent  of  £,■.  Calculate  the 
product  of  the  k  —  I  messages  that  these  neighbors  send  to  the  corresponding  node 
in  G  under  the  hxed-point  messages  m*  and  the  message  that  j/j-/  sends  to  .  Set 
yi  and  Xi)  such  that  the  message  y:  sends  to  Xi  is  equal  to  this  product. 

By  this  construction,  all  leaf  nodes  in  T  will  send  their  neighbors  a  message  from 
m* .  Since  all  non-leaf  nodes  in  T  have  the  same  statistical  relationship  to  their 
neighbors  as  the  corresponding  nodes  in  G,  the  local  message  passing  updates  in  T 
are  identical  to  those  in  G.  Thus  all  messages  in  T  will  be  replicas  of  messages  in 
m* .  □ 

Proof  of  Claim  4:  Using  this  lemma  we  can  prove  claim  4.  Let  fi  be  the  conditional 
mean  in  the  modihed  unwrapped  tree  then,  by  the  periodic  beliefs  lemma: 


y  =  Oyo  (34) 

where  /Uo(*)  is  the  posterior  mean  at  node  i  under  m* . 

We  also  know  that  y  is  a  solution  to: 

Cxxf  —  Cxyjj  (35) 

where  Vxx,  Vxy,  y  refers  to  quantities  in  the  modihed  unwrapped  tree.  So: 

CxxGyQ  —  FxyiJ  (36) 


We  use  the  notation  [A]^  to  indicate  taking  the  m  hrst  rows  of  a  matrix  A.  Note 
that  for  any  two  matrices  [AB]^  =  [A]^  B.  Taking  the  hrst  m  rows  of  equation  36 
gives: 


CxxO  yo  —  Fxyjj 

-I  m  L  - 


m 


(37) 


As  in  the  previous  proofs,  the  key  idea  is  to  relate  the  inverse  covariance  matrix  of 
the  modihed  unwrapped  tree  to  that  of  the  original  loopy  graph.  Since  all  non-leaf 
nodes  in  the  modihed  unwrapped  tree  have  the  same  neighborhood  relationships 
with  their  neighbors  as  the  corresponding  nodes  in  the  loopy  graph  we  have,  for 
any  m  <  N  —  L: 

\v,.o]  =  [OV,.]^  (38) 

L  J  m 

and: 

=  [OVxyjj]^  (39) 

L  -I  m 

Substituting  these  relationships  into  equation  37  gives: 

[0]„  =  -  [0]„  V.yy  (40) 

This  equation  holds  for  any  m  <  N  —  L.  Since  we  can  unwrap  the  tree  to  arbitrarily 
large  size  we  can  choose  m  such  that  [0]„  has  n  independent  rows  (this  happens 
once  all  nodes  in  the  loopy  graph  appear  at  least  once  in  the  modihed  unwrapped 
tree).  Thus: 

—  ^xyV 

hence  the  means  derived  from  the  hxed-point  messages  are  exact.  □ 

3.1  Non-Gaussian  variables 

In  Sect.  1  we  described  the  “max-product”  belief  propagation  algorithm  that  hnds 
the  MAP  estimate  for  each  node  [18,  23]  of  a  network  without  loops.  As  with  max- 
product,  iterating  this  algorithm  is  a  method  of  hnding  a  hxed-point  of  the  message 
passing  dynamics.  How  does  the  assignment  derived  from  this  hxed-point  compare 
the  MAP  assignment? 

Claim  5:  For  a  graphical  model  of  arbitrary  toplogy  with  continuous  potential 
functions,  if  m*  is  a  hxed-point  of  the  max-product  message-passing  dynamics  then 
the  assignment  based  on  that  hxed-point  is  a  local  maximum  of  the  posterior  prob¬ 
ability. 

Since  the  posterior  probability  factorizes  into  a  product  of  pairwise  potentials,  the 
log  posterior  will  have  the  form, 

log  7^(2: 1  y)  —  J{j  (^X{  ^  X  j'j  -f  Jii{xi^yi^  (^^) 


Assuming  the  clique  potential  functions  are  differentiable  and  hnite,  the  MAP  so¬ 
lution,  u,  will  satisfy 

-^logP{x\y)\^=^  =  0  (43) 

OX{ 

We  will  write  this  as: 

Vu  =  0  (44) 

where  H  is  a  nonlinear  operator. 

As  in  the  previous  section,  we  can  use  the  periodic  belief  lemma  to  construct  a 
modihed  unwrapped  tree  of  arbitrary  size  based  on  m* .  If  we  denote  by  V  the 


nonlinear  set  of  equations  that  the  solution  to  the  modihed  unwrapped  problem 
must  satisfy  we  have: 

Vu  =  0  (45) 

Because  of  the  periodic  belief  lemma: 

u  =  Ouq  (46) 


Similarly,  as  in  the  previous  section,  all  the  non-leaf  nodes  will  have  the  same 
statistical  relationship  with  their  neighbors  as  do  the  corresponding  nodes  in  the 
loopy  network,  so: 

=  [Oy]„  (47) 

L  -I  m 

where  the  left  and  right  hand  sides  are  nonlinear  operators. 

Substituting  Eqs.  46  and  47  into  Eq.  45  gives: 


Vuo  =  0  (48) 

A  similar  substitution  can  be  made  with  the  second  derivative  equations  to  show 
that  the  Hessian  at  uq  is  positive  dehnite.  Thus  the  assignment  based  on  m*  is  at 
least  a  local  maximum  of  the  posterior.  □ 

4  Vector  valued  nodes 

All  of  the  results  we  have  derived  so  far  hold  for  vector-valued  nodes  as  well  but 
the  indexing  notation  is  slightly  more  cumbersome.  We  use  a  stacking  convention, 
in  which  we  dehne  the  vector  x  by: 


X  = 


Xl 

X2 


(49) 


Thus  supposing  xi  is  a  vector  of  length  2  then  *(1)  is  the  Rrst  component  of  xi  and 
x{2)  is  the  second  component  of  xi  {not  *2)-  We  define  j/  in  a  similar  fashion. 

Using  this  stacking  notation  the  equations  for  exact  inference  in  Gaussians  remain 
unchanged,  but  we  need  to  be  careful  in  reading  out  the  posterior  means  and  covari¬ 
ances  from  the  stacked  vectors.  Thus  we  can  still  complete  the  square  in  stacked 
notation  to  obtain: 

—  ^xyV 

and  C^\y  =  V~J-.  Assuming  xi  is  of  length  2,  the  posterior  mean  of  *1  is  given 
by: 

"■  =  (  M2)  ) 

and  the  posterior  covariance  matrix  Ei  is  given  by: 


E 


Cx\y{lA)  C^\y{l,2) 

G,,|,(2,l)  G,,|,(2,2) 


(52) 


We  use  the  same  stacked  notation  for  x  and  define  the  matrix  O  such  that  0{i,  j)  =  1 
if  x{i)  is  a  replica  of  x{j)  and  zero  otherwise.  Using  this  notation,  the  relationships 

between  unwrapped  and  loopy  quantities  (e.g.  =  [OV^x]^  )  still  hold. 

L  J  m 

Thus  all  the  analysis  done  in  the  previous  sections  holds  —  the  only  difference  are 
the  semantics  of  quantities  such  as  /u(l),  which  need  to  be  understood  as  a  scalar 
component  of  a  (possibly)  larger  vector  ni.  For  explicitness,  we  restate  the  hve 
claims  for  vector  valued  nodes. 

For  any  i,j  less  than  or  equal  to  the  number  of  components  in  xi  we  have: 

Claim  la: 

jl{i)  =  ^{i)  ^  Cx^\yr  (53) 

where  r  is  a  vector  that  is  zero  everywhere  but  the  last  L  components  (corresponding 
to  the  leaf  nodes). 

Claim  2a: 

^x\y{j':i')  —  ^x\y{,^:i')  “f  ^Xj\y^'l  ^Xj\y^'2 

where  ri  is  a  vector  that  is  zero  everywhere  but  the  last  L  components  (correspond¬ 
ing  to  the  leaf  nodes)  and  r2  is  equal  to  1  for  all  components  corresponding  to 
non-root  nodes  in  the  unwrapped  tree  that  reference  x{i).  All  other  components  of 
r2  are  zero. 

Claim  3a:  If  the  conditional  correlation  between  all  components  of  the  root  node 
and  the  leaf  nodes  decreases  rapidly  enough  then  (1)  belief  propagation  converges 
(2)  the  belief  propagation  means  are  exact  and  (3)  the  i,j  component  of  the  belief 
propagation  covariance  matrices  is  equal  to  the  i,  j  component  of  the  true  covariance 
matrices  minus  the  summed  conditional  correlations  between  x{j)  and  all  nonroot 
x{k)  that  are  replicas  of  x{i). 

Claim  4a:  For  a  (possibly  vector-valued)  Gaussian  graphical  model  of  arbitrary 
topology,  if  m*  is  a  hxed-point  of  the  message-passing  dynamics,  then  the  means 
based  on  that  hxed-point  are  exact. 

Claim  5a:  For  a  (possibly  vector-valued)  graphical  model  of  arbitrary  topology 
with  continuous  potential  functions,  if  m*  is  a  hxed-point  of  the  max-product 
message-passing  dynamics,  then  the  assignment  based  on  that  hxed-point  is  a  local 
maximum  of  the  posterior  probability. 

We  emphasize  that  these  claims  do  not  need  to  be  reproved  —  all  the  equations 
used  in  proving  the  scalar- valued  case  still  hold  only  the  semantics  we  place  on  the 
individual  components  are  different. 

We  end  this  analysis  with  two  simple  corollaries: 

Corollary  1:  Let  m*  be  a  hxed-point  of  Pearl’s  belief  propagation  algorithm  on 
a  Gaussian  Bayesian  network  of  arbitrary  toplogy  and  arbitrary  clique  size.  Then 
the  means  based  on  m*  are  exact. 

Corollary  2:  Let  m*  be  a  hxed-point  of  Pearl’s  belief  revision  (max-product)  algo¬ 
rithm  on  a  Bayesian  network  with  continuous  joint  probability,  arbitrary  topology 
and  arbitrary  clique  size.  The  assignment  based  on  m*  is  at  least  a  local  maximum 
of  the  posterior  probability. 

These  corollaries  follow  from  claims  4a  and  5a  along  with  the  equivalence  between 


(b) 


Figure  4:  (a)  25  x  25  graphical  model  for  simulation.  The  unobserved  nodes  (un- 
hlled)  were  connected  to  their  four  nearest  neighbors  and  to  an  observation  node 
(hlled).  (b)  The  error  of  the  estimates  of  loopy  propagation  and  successive  over¬ 
relaxation  (SOR)  as  a  function  of  iteration.  Note  that  belief  propagation  converges 
much  faster  than  SOR. 


Pearl’s  propagation  rules  and  the  propagation  rules  for  pairwise  undirected  graphical 
models  analyzed  here  [23].  Note  that  even  if  the  Bayesian  network  contained  only 
scalar  nodes,  the  conversion  to  pairwise  cliques  may  necessitate  using  vector-valued 
nodes. 

5  Simulations 

We  ran  belief  propagation  on  a  25  x  25  2D  grid.  The  joint  probability  was: 

P{x,y)  =  exp(-^  -  XjY  -  'Y^Wii{xi  -  yif  )  (55) 

ij  i 

where  Wij  =  0  if  nodes  Xi,Xj  are  not  neighbors  and  0.01  otherwise  and  wa  was 
randomly  selected  to  be  0  or  1  for  all  i  with  probability  of  1  set  to  0.2.  The  obser¬ 
vations  yi  were  chosen  randomly.  This  problem  corresponds  to  an  approximation 
problem  from  sparse  data  where  only  20%  of  the  points  are  visible. 

We  found  the  exact  posterior  by  solving  Eq.  10.  We  also  ran  loopy  belief  propagation 
and  found  that  when  it  converged,  the  loopy  means  were  identical  to  the  true  means 
up  to  machine  precision.  Also,  as  explained  by  the  theory,  the  loopy  variances  were 
too  small  —  the  loopy  estimate  was  overconhdent. 

In  many  applications,  the  solution  of  equation  10  by  matrix  inversion  is  intractable 
and  iterative  methods  are  used.  Figure  4  compares  the  error  in  the  means  as  a 
function  of  iterations  for  loopy  propagation  and  successive-over-relaxation  (SOR), 
considered  one  of  the  best  relaxation  methods  [20].  Note  that  after  hve  iterations 
loopy  propagation  gives  essentially  the  right  answer  while  SOR  requires  many  more. 
As  expected  by  the  fast  convergence,  the  approximation  error  in  the  variances  was 
quite  small.  The  median  error  was  0.018.  For  comparison  the  true  variances  ranged 
from  0.01  to  0.94  with  a  mean  of  0.322.  Also,  the  nodes  for  which  the  approximation 
error  was  worse  were  indeed  the  nodes  that  converged  slower. 


The  slow  convergence  of  SOR  on  problems  such  as  these  lead  to  the  development 
of  multi-resolution  models  in  which  the  MRF  is  approximated  by  a  tree  [14,  5]  and 
an  algorithm  equivalent  to  belief  propagation  is  then  run  on  the  tree.  Although 
the  multi-resolution  models  are  much  more  efficient  for  inference,  the  tree  structure 
often  introduces  block  artifacts  in  the  estimate.  Our  results  suggest  that  one  can 
simply  run  belief  propagation  on  the  original  MRF  and  get  the  exact  posterior 
means.  Although  the  posterior  variances  will  not  be  correct,  for  those  nodes  for 
which  loopy  propagation  converged  rapidly  the  approximation  error  will  be  small. 

6  Discussion 

Our  main  interest  in  analyzing  the  Gaussian  case  was  to  understand  the  performance 
of  belief  propagation  in  networks  with  multiple  loops.  Although  there  are  many 
special  properties  of  Gaussians,  we  are  struck  by  the  similarity  of  the  analytical 
results  reported  here  for  multi-loop  Gaussians  and  the  analytical  results  for  single 
loops  and  general  distributions  reported  in  [23].  The  most  salient  similarities  are: 

•  In  single  loop  networks  with  binary  nodes,  the  mode  at  each  node  is  guar¬ 
anteed  to  be  correct  but  the  conhdence  in  the  mode  may  be  incorrect.  In 
Gaussian  networks  with  multiple  loops  the  mean  at  each  node  is  guaran¬ 
teed  to  be  correct  but  the  conhdence  around  that  mean  will  in  general  be 
incorrect. 

•  In  single  loop  networks  fast  convergence  is  correlated  with  good  approxi¬ 
mation  of  the  beliefs.  This  is  also  true  for  Gaussian  networks  with  multiple 
loops. 

•  In  single  loop  networks  the  convergence  rate  and  the  approximation  error 
were  determined  by  a  ratio  of  eigenvalues  A1/A2.  This  ratio  determines 
the  extent  of  the  statistical  dependencies  between  the  root  and  the  leaf 
nodes  in  the  unwrapped  network  for  a  single  loop.  In  Gaussian  networks 
the  convergence  rate  and  the  approximation  error  are  determined  by  the 
off-diagonal  terms  of  Cx\y  These  terms  quantify  the  extent  of  conditional 
dependencies  between  the  root  nodes  and  the  leaf  nodes  of  the  unwrapped 
network. 

These  similarities  are  even  more  intriguing  when  one  considers  how  different  Gaus¬ 
sians  graphical  models  are  from  discrete  models  with  arbitrary  potentials  and  a 
single  loop.  In  Gaussians  the  conditional  mean  is  equal  to  the  conditional  mode 
and  there  is  only  one  maximum  in  the  posterior  probability,  while  the  single  loop 
discrete  models  may  have  multiple  maxima,  none  of  which  will  be  equal  to  the  mean. 
Furthermore,  in  terms  of  approximate  inference  the  two  classes  behave  quite  differ¬ 
ently.  For  example,  mean  held  approximations  are  exact  for  Gaussian  MRFs  while 
they  work  poorly  in  discrete  networks  with  a  single  loop  in  which  the  connectivity 
is  sparse  [21].  The  resemblance  of  the  results  for  Gaussian  graphical  models  and 
for  single  loops  leads  us  to  believe  that  similar  results  may  hold  for  a  larger  class  of 
networks. 

The  sum-product  and  max-product  belief  propagation  algorithms  are  appealing, 
fast  and  easily  parallelizable  algorithms.  Due  to  the  well  known  hardness  of  prob¬ 
abilistic  inference  in  graphical  models,  belief  propagation  will  obviously  not  work 


for  arbitrary  networks  and  distributions.  Nevertheless,  there  is  a  growing  body  of 
empirical  evidence  showing  its  success  in  many  loopy  networks.  Our  results  give  a 
theoretical  justihcation  for  applying  belief  propagation  in  networks  with  multiple 
loops.  This  may  enable  fast,  approximate  probabilistic  inference  in  a  range  of  new 
applications. 
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